Web archives and urban analytics


Emmanouil Tranos

University of Bristol, Alan Turing Institute
, @EmmanouilTranos, etranos.info

Contents

  1. Material and immaterial regional interdependencies: using the web to predict regional trade flows
  2. Turing PhD project: Guilia Occhini


Web archives and the evolution of the digital economy

Material and immaterial regional interdependencies: using the web to predict regional trade flows



In collaboration with Andre Carrascal Incera & George Willis [work in progress]

Regional trade flows

  • Regions are more specialised and open than countries
  • Important external trade dependences (Thissen et al. 2016)
  • Regions vary in terms of their specialisation patterns and, therefore, in their trade relationships and openness

Regional trade flows

  • Knowing and predicting regional trade helps to understand:
    • regional economic performance
    • exposure to external shocks
    • place-based development strategies
  • Employment vulnerability and transmission of internal and external shocks is different for different regions.
  • Workers in regions in the US with a specialisation in specific manufacturing industries were more vulnerable for the emergence of China (Autor et al. 2013)

Regional trade flows: hardly any data

  • Big caveat: interregional trade data
  • Europe: spatially disaggregated IO for NUTS2 regions (Thissen et al., 2018)
  • Coslty, difficult exercise

Our contribution

  • Utilise the digital traces that interregional trade leave behind
  • Model and predict interregional trade flows for the UK
  • Scrape open web data
  • Hyperlinks between commercial websites
  • Machine learning techniques for out-of-sample predictions
  • Hypothesis: such hyperlinks reflect business and trade relations

Web data and spatial research

Web data and businesses

  • Businesses may not expose all of their strategies on their websites, but neither do they do during surveys (Arora et al. 2013)
  • Business websites:
    • spreading information
    • establishing a public image
    • supporting online transactions
    • sharing opinions

Empirical strategy

Web data: The Internet Archive

  • The largest archive of webpages in the world
  • 273 billion webpages from over 361 million websites, 15 petabytes of storage (1996 -)
  • A web crawler starts with a list of URLs (a seed list) to crawl and downloads a copy of their content
  • Using the hyperlinks included in the crawled URLs, new URLs are identified and crawled (snowball sampling)
  • Time-stamp

Web data: The Internet Archive

Web data: The Internet Archive

Our web data

  • JISC UK Web Domain Dataset: all archived webpages from the .uk domain 1996-2010
  • Curated by the British Library
  • Tranos, E., and C. Stich. 2020. Individual internet usage and the availability of online content of local interest: A multilevel approach. Computers, Environment and Urban Systems 79:101371
  • Tranos, E., T. Kitsos, and R. Ortega-Argilés, R. 2021 Digital economy in the UK: Regional productivity effects of early adoption. Regional Studies. Forthcomming

Our web data

  1. Geoindex: a subset of the .uk archived webpages which contain a UK postcode
  2. Hyperlinks

Modelling strategy

\[trade_{ijt} \sim hyperlinks_{ijt} + distance_{ij} + \\ pop.density_{it} + pop.density_{it} + empl_{it} + empl_{jt}\]

  • Predict inter-regional trade flows using Random Forests (RF)
  • Tree-based ensemble learning method (Breiman 2001)
  • Widely used both for regression and classification problems (Biau 2012)
  • Short training time (Caruana, Karampatziakis, and Yessenalina 2008; Liaw, Wiener, and others 2002; Yan, Liu, and Zhao 2020)

Modelling strategy: rolling forecasting

  • Train RF models on data from years \(t\) and \(t + 1\) to increase the size of the training dataset
  • 10-fold cross validation
  • Predict unseen data from year \(t + 2\)
  • No data pooling to maintain their temporal structure both for methodological and conceptual reasons.
  • No data leakage

Modelling strategy: predictive performance

\[\begin{align} R^2 = 1 - \frac{\sum_{k} (y_{k} - \hat{y_{k}})^2} {\sum_{k} (y_{k} - \overline{y_{k}})^2} \label{eq:rsquared} \end{align}\]

\[\begin{align} MAE = \frac{1}{N} \sum_{k = 1}^{N} |\hat{y_{k}} - y_{k}| \label{eq:mae} \end{align}\]

\[\begin{align} RMSE = \sqrt{\frac{\sum_{k = 1}^{N} (\hat{y_{k}} - y_{k})^2} {N}} \label{eq:rmse} \end{align}\]

  • Larger errors carry more weight for \(RMSE\)

Data cleaning

Unique postcodes frequencies, 2000

level freq perc cumfreq cumperc
(0,1] 41596 0.718 41596 0.718
(1,2] 6451 0.111 48047 0.830
(2,10] 6163 0.106 54210 0.936
(10,100] 2975 0.051 57185 0.988
(100,1000] 646 0.011 57831 0.999
(1000,10000] 62 0.001 57893 1.000
(10000,100000] 4 0.000 57897 1.000
  • Websites with a large number of postcodes: e.g. directories, real estate websites
  • Websites with a unique location \(\Leftarrow\) The focus of analysis for now

Directory website with a lot of postcodes

Website with a unique postcode in London

Desctiptive statistics

Interregional trade flows

Correlations with interregional trade

year hyperlinks distance
2000 0.539 -0.219
2001 0.578 -0.221
2002 0.793 -0.221
2003 0.483 -0.220
2004 0.807 -0.223
2005 0.643 -0.219
2006 0.585 -0.219
2007 0.598 -0.214
2008 0.491 -0.205
2009 0.922 -0.207
2010 0.674 -0.205

Results

Train on year t and t + 1

Feature importance

Test on t + 2

year RMSE Rsquared MAE
2002 951.04 0.96 166.99
2003 1254.95 0.94 230.47
2004 1019.69 0.95 179.42
2005 1852.54 0.89 310.94
2006 1713.55 0.92 307.53
2007 1974.77 0.90 210.49
2008 1534.67 0.92 248.84
2009 1237.98 0.93 215.63
2010 3165.46 0.63 302.44

Test on t + 2

Conclusions

  • Interregional trade is difficult to capture
  • Interregional trade leaves digital trail (digital exhaust)
  • Prediction framework
  • Next steps:
    • Spatially and industrially disaggregated approaches
    • Opportunity for local authorities to estimate their export base / specialisations

Turing PhD project: Guilia Occhini

  • Linking business records with business web data
  • Large state-of-the-art web archives
  • Expected outcome: open data set (and code) with matched business records and archived (recent and older) business website data
  • Research questions: gender, ethnicity and digital divides
  • In collaboration with Levi Wolf

Urban analytics

   
1. Modelling and simulation 5. Dynamics
2. AI and machine learning 6. Visualisation and visual analytics
3. Breadth of application 7. Data ethics and public engagement
4. Validation and uncertainty 8. Data platforms

https://www.turing.ac.uk/research/research-programmes/urban-analytics

References

Biau, GÊrard. 2012. “Analysis of a Random Forests Model.” Journal of Machine Learning Research 13 (Apr): 1063–95.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103. ICML ’08. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1390156.1390169.

Halavais, Alexander. 2000. “National Borders on the World Wide Web.” New Media & Society 2 (1): 7–28.

Holmberg, Kim. 2010. “Co-Inlinking to a Municipal Web Space: A Webometric and Content Analysis.” Scientometrics 83 (3): 851–62.

Holmberg, Kim, and Mike Thelwall. 2009. “Local Government Web Sites in Finland: A Geographic and Webometric Analysis.” Scientometrics 79 (1): 157–69.

Janc, Krzysztof. 2015. “Geography of Hyperlinks—Spatial Dimensions of Local Government Websites.” European Planning Studies 23 (5): 1019–37.

Jones, Brant W, Ben Spigel, and Edward J Malecki. 2010. “Blog Links as Pipelines to Buzz Elsewhere: The Case of New York Theater Blogs.” Environment and Planning B: Planning and Design 37 (1): 99–111.

Keßler, Carsten. 2017. “Extracting Central Places from the Link Structure in Wikipedia.” Transactions in GIS 21 (3): 488–502.

Krüger, Miriam, Jan Kinne, David Lenz, and Bernd Resch. 2020. “The Digital Layer: How Innovative Firms Relate on the Web.” ZEW-Centre for European Economic Research Discussion Paper, nos. 20-003.

Liaw, Andy, Matthew Wiener, and others. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.

Lin, Jia, Alexander Halavais, and Bin Zhang. 2007. “The Blog Network in America: Blogs as Indicators of Relationships Among Us Cities.” Connections 27 (2): 15–23.

Salvini, Marco M, and Sara I Fabrikant. 2016. “Spatialization of User-Generated Content to Uncover the Multirelational World City Network.” Environment and Planning B: Planning and Design 43 (1): 228–48.

Vaughan, Liwen. 2004. “Exploring Website Features for Business Information.” Scientometrics 61 (3): 467–77.

Yan, Xiang, Xinyu Liu, and Xilei Zhao. 2020. “Using Machine Learning for Direct Demand Modeling of Ridesourcing Services in Chicago.” Journal of Transport Geography 83: 102661.